An application of KMeans classifier on the stock markets

Recently I discovered an application of KMeans classifier. In particular, I was surprised that this classifier can be used to cluster stocks that don't have an apparent relation, by sector or other means, but that tend to move together.

I wanted to explore this concept more in depth here. For this reason, I created this notebook that:

  1. Pull data from a free provider using known but unrelated stocks
  2. Visualize price trends as well as moving averages (called Bollinger Bands in the financial world)
  3. Apply clustering to group stocks from different sectors.

Structure of the notebook

  1. Import libraries
  2. Pull the data and do a very bare bone EDA (mostly to check there are no missing values)
  3. Apply some visualization using plotly and generate Bollinger Bands at different standard deviations
  4. Train KMeans and visualize resulting cluster.

1. Import libraries

In [1]:
# data manipulation and cleaning
import numpy as np
import pandas as pd
from pandas.tseries.offsets import MonthEnd
import datetime
from functools import reduce

# library to help importing financial data from Tiingo
import requests
from tiingo import TiingoClient


# visualisation
import matplotlib.pyplot as plt
import matplotlib.ticker as tick
from matplotlib import cm
import plotly.graph_objects as go
from plotly import tools
import plotly.io as pio
from plotly.io import to_html
import seaborn as sns
from IPython.display import HTML
import psutil

# modelling
from sklearn.preprocessing import Normalizer
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.manifold import TSNE

plt.style.use('ggplot')
%matplotlib inline
In [2]:
# remove code toggle
HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.''')
Out[2]:
To toggle on/off the raw code, click here.

2. Pull the data

To get the financial data we can connect to the REST API from Tiingo. To connect to their API, we can create a free account to receive an auth token. Below, a working example using my persona auth-token.

In [3]:
# here we evaluate the parameters that we want to use. Needed: Auth token and timeframe
params = {
    'Content-Type': 'application/json', 
    'Authorization' : 'Token 3735ce1f44efa040f55dbeae7092620f5b4fc4fb', # substitute your auth-token
}

# this is the url that query the specific ticker
url = 'https://api.tiingo.com/tiingo/daily/aapl/prices?startDate=2019-01-02&token=3735ce1f44efa040f55dbeae7092620f5b4fc4fb'


response = requests.get(url = url, headers=params)
In [4]:
# Extract JSON data from the response
data = response.json()

# Load data to a data frame
df = pd.DataFrame(data)

# View the data's dtypes
df.info()

df.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 326 entries, 0 to 325
Data columns (total 13 columns):
date           326 non-null object
close          326 non-null float64
high           326 non-null float64
low            326 non-null float64
open           326 non-null float64
volume         326 non-null int64
adjClose       326 non-null float64
adjHigh        326 non-null float64
adjLow         326 non-null float64
adjOpen        326 non-null float64
adjVolume      326 non-null int64
divCash        326 non-null float64
splitFactor    326 non-null float64
dtypes: float64(10), int64(2), object(1)
memory usage: 33.2+ KB
Out[4]:
date close high low open volume adjClose adjHigh adjLow adjOpen adjVolume divCash splitFactor
0 2019-01-02T00:00:00.000Z 157.92 158.8500 154.23 154.89 37039737 155.201765 156.115757 151.575280 152.223920 37039737 0.0 1.0
1 2019-01-03T00:00:00.000Z 142.19 145.7200 142.00 143.98 91312195 139.742522 143.211761 139.555792 141.501711 91312195 0.0 1.0
2 2019-01-04T00:00:00.000Z 148.26 148.5499 143.80 144.53 58607070 145.708040 145.992950 141.324809 142.042244 58607070 0.0 1.0
3 2019-01-07T00:00:00.000Z 147.93 148.8300 145.90 148.70 54777764 145.383720 146.268229 143.388662 146.140467 54777764 0.0 1.0
4 2019-01-08T00:00:00.000Z 150.75 151.8200 148.52 149.56 41025314 148.155181 149.206763 145.963565 146.985664 41025314 0.0 1.0

Although the above snippet pull the data we are looking for, it is too cumbersome to write down all the different urls. Luckily for me, Tiingo website also provides an useful Python library that streamlines most of the operations we will need.

I selected 37 random stock tickers. I didn't use any specific logic to pick these, the only requirements was that they didn't all belong to the same industry sector.

In [5]:
# initialize a session
config = {}

# To reuse the same HTTP Session across API calls (and have better performance), include a session key.
config['session'] = True

# If you don't have your API key as an environment variable,
# pass it in via a configuration dictionary.
config['api_key'] = '3735ce1f44efa040f55dbeae7092620f5b4fc4fb'

# Initialize
client = TiingoClient(config)
In [6]:
#Get a pd.DataFrame for a list of symbols for a specified metric_name (default is adjClose if no
#metric_name is specified):

ticker_history_open = client.get_dataframe(
    ['AMZN',# Amazon
     'WBA', # 'Walgreen'
     'NOC', #'Northrop Grumman'
     'LMT', # 'Lockheed Martin'
     'MCD', # 'McDonalds'
     'INTC', # 'Intel'
     'NAV', # 'Navistar'
     'IBM', #'IBM'
     'TXN', # 'Texas Instruments'
     'GE', # 'General Electrics'
     'SYMC', # 'Symantec'
     'PEP', # 'Pepsi'
     'KO', # 'Coca Cola':
     'JNJ', # 'Johnson & Johnson'
     'TM', # 'Toyota'
     'HMC', # 'Honda'
     'SNE', # 'Sony':
     'XOM', # 'Exxon'
     'CVX', #'Chevron'
     'VLO', # 'Valero Energy'
     'F', # 'Ford'
     'GOOGL', # google
     'AAPL', # apple
     'MSFT', # microsoft
     'PFE', # Pfizer
     'MRK', # Abbvie
     'BA', # Visa
     'COST', # MasterCard
     'C', # CitiGroup
     'DB', # Deutsche Bank
     'NKE', # Nike
     'DIS', # Disney
     'CAT', # Caterpillar
     'SPY', # S&P index
     'LH', # Lab Corp
     'RCL', # Royal Caribbean
     'NNN', # National Retail Properties
    ],                                  
    frequency='daily',                
    metric_name='open', # opening prices
    startDate='2005-01-01',
    endDate='2017-12-31')

ticker_history_close = client.get_dataframe(
    ['AMZN',# Amazon
     'WBA', # 'Walgreen'
     'NOC', #'Northrop Grumman'
     'LMT', # 'Lockheed Martin'
     'MCD', # 'McDonalds'
     'INTC', # 'Intel'
     'NAV', # 'Navistar'
     'IBM', #'IBM'
     'TXN', # 'Texas Instruments'
     'GE', # 'General Electrics'
     'SYMC', # 'Symantec'
     'PEP', # 'Pepsi'
     'KO', # 'Coca Cola':
     'JNJ', # 'Johnson & Johnson'
     'TM', # 'Toyota'
     'HMC', # 'Honda'
     'SNE', # 'Sony':
     'XOM', # 'Exxon'
     'CVX', #'Chevron'
     'VLO', # 'Valero Energy'
     'F', # 'Ford'
     'GOOGL', # google
     'AAPL', # apple
     'MSFT', # microsoft
     'PFE', # Pfizer
     'MRK', # Abbvie
     'BA', # Visa
     'COST', # MasterCard
     'C', # CitiGroup
     'DB', # Deutsche Bank
     'NKE', # Nike
     'DIS', # Disney
     'CAT', # Caterpillar
     'SPY', # S&P index
     'LH', # Lab Corp
     'RCL', # Royal Caribbean
     'NNN', # National Retail Properties
    ],                                  
    frequency='daily',                
    metric_name='close', # closing prices
    startDate='2005-01-01',
    endDate='2017-12-31')
/Users/manfredimiraula/.virtualenvs/ds/lib/python3.7/site-packages/tiingo/api.py:270: FutureWarning:

Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.


In [7]:
ticker_history_open.head()
Out[7]:
AMZN WBA NOC LMT MCD INTC NAV IBM TXN GE ... COST C DB NKE DIS CAT SPY LH RCL NNN
date
2005-01-03 00:00:00+00:00 44.95 39.75 54.05 55.40 31.60 23.64 44.10 98.97 24.93 36.71 ... 48.40 49.00 88.93 90.60 27.81 96.10 121.56 49.83 54.65 20.60
2005-01-04 00:00:00+00:00 42.62 40.35 52.80 53.98 31.74 23.19 43.96 97.74 24.19 36.60 ... 48.19 48.28 87.70 90.67 28.22 96.05 120.46 48.70 54.30 20.41
2005-01-05 00:00:00+00:00 41.57 40.09 52.35 53.29 32.08 22.46 42.05 96.60 23.50 36.15 ... 47.20 48.05 85.74 89.65 27.75 94.25 118.74 48.10 53.67 19.98
2005-01-06 00:00:00+00:00 41.81 41.05 53.51 54.63 31.84 22.58 41.55 96.54 22.98 36.03 ... 48.86 48.80 87.73 88.27 27.43 92.40 118.44 47.92 53.15 18.95
2005-01-07 00:00:00+00:00 41.33 41.19 53.13 54.51 32.03 22.72 41.40 96.50 23.00 36.36 ... 47.83 49.10 88.68 88.32 27.50 93.67 118.97 48.10 53.50 19.28

5 rows × 37 columns

In [8]:
ticker_history_close.head()
Out[8]:
AMZN WBA NOC LMT MCD INTC NAV IBM TXN GE ... COST C DB NKE DIS CAT SPY LH RCL NNN
date
2005-01-03 00:00:00+00:00 44.52 40.38 53.65 54.21 31.82 23.07 43.89 97.75 24.18 36.59 ... 47.90 48.27 88.42 90.69 27.85 95.07 120.30 48.70 54.15 20.31
2005-01-04 00:00:00+00:00 42.14 40.09 52.50 53.29 32.08 22.61 42.79 96.70 23.60 36.15 ... 47.04 47.86 86.20 89.40 27.55 94.02 118.83 48.49 53.61 20.08
2005-01-05 00:00:00+00:00 41.77 41.12 53.30 54.63 31.83 22.39 41.30 96.50 22.89 35.93 ... 47.22 48.46 85.90 88.33 27.40 92.22 118.01 47.88 52.85 19.12
2005-01-06 00:00:00+00:00 41.05 41.14 53.16 54.51 32.08 22.46 41.40 96.20 22.80 36.22 ... 47.90 48.93 87.85 88.22 27.44 93.58 118.61 48.17 53.25 19.12
2005-01-07 00:00:00+00:00 42.32 41.50 53.05 54.77 31.87 22.80 40.43 95.78 22.92 36.00 ... 47.49 48.65 87.20 87.70 27.17 93.38 118.44 47.85 53.06 18.88

5 rows × 37 columns

In [9]:
ticker_history_close.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3272 entries, 2005-01-03 00:00:00+00:00 to 2017-12-29 00:00:00+00:00
Data columns (total 37 columns):
AMZN     3272 non-null float64
WBA      3272 non-null float64
NOC      3272 non-null float64
LMT      3272 non-null float64
MCD      3272 non-null float64
INTC     3272 non-null float64
NAV      3272 non-null float64
IBM      3272 non-null float64
TXN      3272 non-null float64
GE       3272 non-null float64
SYMC     3272 non-null float64
PEP      3272 non-null float64
KO       3272 non-null float64
JNJ      3272 non-null float64
TM       3272 non-null float64
HMC      3272 non-null float64
SNE      3272 non-null float64
XOM      3272 non-null float64
CVX      3272 non-null float64
VLO      3272 non-null float64
F        3272 non-null float64
GOOGL    3272 non-null float64
AAPL     3272 non-null float64
MSFT     3272 non-null float64
PFE      3272 non-null float64
MRK      3272 non-null float64
BA       3272 non-null float64
COST     3272 non-null float64
C        3272 non-null float64
DB       3272 non-null float64
NKE      3272 non-null float64
DIS      3272 non-null float64
CAT      3272 non-null float64
SPY      3272 non-null float64
LH       3272 non-null float64
RCL      3272 non-null float64
NNN      3272 non-null float64
dtypes: float64(37)
memory usage: 971.4 KB
In [10]:
display(ticker_history_open.isnull().sum()),
display(ticker_history_close.isnull().sum())
AMZN     0
WBA      0
NOC      0
LMT      0
MCD      0
INTC     0
NAV      0
IBM      0
TXN      0
GE       0
SYMC     0
PEP      0
KO       0
JNJ      0
TM       0
HMC      0
SNE      0
XOM      0
CVX      0
VLO      0
F        0
GOOGL    0
AAPL     0
MSFT     0
PFE      0
MRK      0
BA       0
COST     0
C        0
DB       0
NKE      0
DIS      0
CAT      0
SPY      0
LH       0
RCL      0
NNN      0
dtype: int64
AMZN     0
WBA      0
NOC      0
LMT      0
MCD      0
INTC     0
NAV      0
IBM      0
TXN      0
GE       0
SYMC     0
PEP      0
KO       0
JNJ      0
TM       0
HMC      0
SNE      0
XOM      0
CVX      0
VLO      0
F        0
GOOGL    0
AAPL     0
MSFT     0
PFE      0
MRK      0
BA       0
COST     0
C        0
DB       0
NKE      0
DIS      0
CAT      0
SPY      0
LH       0
RCL      0
NNN      0
dtype: int64

We have now the open and close prices for each stock we are interested in. Now, we want to generate a single dataframe with average daily price values for each of the stock and place them in a unique dataframe.

It seems that we don't have null values. Thus, we can proceed to the next step. Visualization!

3. Visualization of time series, moving average and Bollinger bands

In [11]:
ts_avg_prices = (ticker_history_open + ticker_history_close )/2

# to make it easier to call the specific stock, we apply the lower() to the column names
ts_avg_prices.columns = ts_avg_prices.columns.str.lower()

Some background knowledge on Bollinger bands can be found here. The concept is simple. Some stocks tend to follow what is called a mean reversal, they tend to fluctuate around their mean price. By applying confidence levels (standard deviation) around their mean, we can evaluate the upper and lower bands that represent some degree of probability for the price to fluctuate to.

In [12]:
ts_avg_prices.columns
Out[12]:
Index(['amzn', 'wba', 'noc', 'lmt', 'mcd', 'intc', 'nav', 'ibm', 'txn', 'ge',
       'symc', 'pep', 'ko', 'jnj', 'tm', 'hmc', 'sne', 'xom', 'cvx', 'vlo',
       'f', 'googl', 'aapl', 'msft', 'pfe', 'mrk', 'ba', 'cost', 'c', 'db',
       'nke', 'dis', 'cat', 'spy', 'lh', 'rcl', 'nnn'],
      dtype='object')
In [13]:
# generate 20 day rolling mean, the basis moving average for Bollinger bands
moving_avg = ts_avg_prices.apply(lambda x: x.rolling(20).mean())

# generate 20 day rolling standard deviation, to generate upper and lower bands
moving_std = ts_avg_prices.apply(lambda x: x.rolling(20).std())

# we merge all together and rename the columns accordingly to the order 
df_to_merge = [ts_avg_prices, moving_avg, moving_std]

df_final = reduce(lambda left,right: pd.merge(left,right,left_index = True, right_index = True), df_to_merge)

df_final.columns = [
    'amzn', 'wba', 'noc', 'lmt', 'mcd', 'intc', 'nav', 'ibm', 'txn', 'ge',
    'symc', 'pep', 'ko', 'jnj', 'tm', 'hmc', 'sne', 'xom', 'cvx', 'vlo',
    'f', 'googl', 'aapl', 'msft', 'pfe', 'mrk', 'ba', 'cost', 'c', 'db',
    'nke', 'dis', 'cat', 'spy', 'lh', 'rcl', 'nnn',
    'amzn_20d_mean', 'wba_20d_mean', 'noc_20d_mean', 'lmt_20d_mean', 'mcd_20d_mean', 'intc_20d_mean', 'nav_20d_mean', 'ibm_20d_mean', 'txn_20d_mean', 'ge_20d_mean',
    'symc_20d_mean', 'pep_20d_mean', 'ko_20d_mean', 'jnj_20d_mean', 'tm_20d_mean', 'hmc_20d_mean', 'sne_20d_mean', 'xom_20d_mean', 'cvx_20d_mean', 'vlo_20d_mean',
    'f_20d_mean', 'googl_20d_mean', 'aapl_20d_mean', 'msft_20d_mean', 'pfe_20d_mean', 'mrk_20d_mean', 'ba_20d_mean', 'cost_20d_mean', 'c_20d_mean', 'db_20d_mean',
    'nke_20d_mean', 'dis_20d_mean', 'cat_20d_mean', 'spy_20d_mean', 'lh_20d_mean', 'rcl_20d_mean', 'nnn_20d_mean',
    'amzn_20d_std', 'wba_20d_std', 'noc_20d_std', 'lmt_20d_std', 'mcd_20d_std', 'intc_20d_std', 'nav_20d_std', 'ibm_20d_std', 'txn_20d_std', 'ge_20d_std',
    'symc_20d_std', 'pep_20d_std', 'ko_20d_std', 'jnj_20d_std', 'tm_20d_std', 'hmc_20d_std', 'sne_20d_std', 'xom_20d_std', 'cvx_20d_std', 'vlo_20d_std',
    'f_20d_std', 'googl_20d_std', 'aapl_20d_std', 'msft_20d_std', 'pfe_20d_std', 'mrk_20d_std', 'ba_20d_std', 'cost_20d_std', 'c_20d_std', 'db_20d_std',
    'nke_20d_std', 'dis_20d_std', 'cat_20d_std', 'spy_20d_std', 'lh_20d_std', 'rcl_20d_std', 'nnn_20d_std'
    
                   
                   ]
In [37]:
# Create traces and generate Bollimger bands plot
fig = go.Figure()
fig.add_trace(
    go.Scatter(
        x=df_final.index, 
        y=df_final['googl'],
        mode='lines',
        name='GOOGL',
        line = dict(
            color='#5F9EA0', 
            width=1.5, 
            dash='dash'
        ),
    )
)

fig.add_trace(
    go.Scatter(
        x=df_final.index, 
        y=df_final['googl_20d_mean'],
        mode='lines',
        name='20d_mean',
        line = dict(
            color='#1E90FF', 
            width=2, 
        )
    )
)
fig.add_trace(
    go.Scatter(
        x=df_final.index, 
        y=(df_final['googl_20d_mean'] + df_final['googl_20d_std']),
        mode='lines',
        name='upper band',
        line = dict(
            color='#1E90FF', 
            width=0.5, 
        )
    )
)
fig.add_trace(
    go.Scatter(
        x=df_final.index, 
        y=(df_final['googl_20d_mean'] - df_final['googl_20d_std']),
        mode='lines',
        name='lower band',
        fill='tonexty',
        line = dict(
            color='#1E90FF', 
            width=0.5, 
        )
    )
)

# Highlight subprime mortgage crisis (ref. https://en.wikipedia.org/wiki/Subprime_mortgage_crisis)
# Add shape regions
fig.update_layout(
    shapes=[
        # 1st highlight during Feb 4 - Feb 6
        dict(
            type="rect",
            xref="x",
            yref="paper",
            x0='2007-01-01',
            y0=0,
            x1='2011-01-01',
            y1=1,
            fillcolor="LightSalmon",
            opacity=0.5,
            layer="below",
            line_width=0,
            name = 'Subprime mortgage crisis'
        ),
        # 2nd highlight during Feb 20 - Feb 23
        dict(
            type="rect",
            xref="x",
            yref="paper",
            x0='2014-04-01',
            y0=0,
            x1='2014-05-01',
            y1=1,
            fillcolor="LightSalmon",
            opacity=0.5,
            layer="below",
            line_width=0,
        )
    ]
)

fig.add_annotation(
        xref="x",
        yref="paper",
        x='2009-01-01',
        y=0.9,
        text="Subprime mortgage crisis",
        showarrow=False,
        font=dict(
            family="Courier New, monospace",
            size=16,
            color="#ffffff"
            ),
        align="left",
        bordercolor="#c7c7c7",
        borderwidth=0,
        borderpad=0,
        bgcolor="#ff7f0e",
        opacity=0.8
        )

fig.add_annotation(
        xref="x",
        yref="paper",
        x='2014-01-01',
        y=0.95,
        text="Stock split",
        showarrow=False,
        font=dict(
            family="Courier New, monospace",
            size=16,
            color="#ffffff"
            ),
        align="left",
        bordercolor="#c7c7c7",
        borderwidth=0,
        borderpad=0,
        bgcolor="#ff7f0e",
        opacity=0.8
        )


# Prefix y-axis tick labels with dollar sign
fig.update_yaxes(tickprefix="$")

# Set figure title
fig.update_layout(title_text="Google Stock Mean price and Bollinger Bands")

# Adding a render to show the static image on github. Unfortunately I haven't found an easy hack to show the dynamic chart
fig.show()
In [15]:
# here we store the visualization in an html file that we will use to write the blog post
fig.write_html('google.html', include_plotlyjs = True)

We can generate a function so that plotting the remaining companies it's easier. For this project, I decided to focus on a simple modelling, thus I won't be spending a lot of time on Visualization and subplotting.

In [38]:
def plotting_bollinger(stock_name):
    '''
    This function takes the stock ticker and plots the mean stock price, 
    the 20d moving average and 20d Bollinger bands at 1 and 2 standard deviation
    1 std = 65% confidence interval; 2 std = 95% confidence interval
    '''
    # Create traces and generate Bollimger bands plot
    fig = go.Figure()
    fig.add_trace(
    go.Scatter(
        x=df_final.index, 
        y=df_final[stock_name],
        mode='lines',
        name= stock_name.upper(),
        line = dict(
            color='#5F9EA0', 
            width=1.5, 
            dash='dash'
            ),
        )
    )

    fig.add_trace(
        go.Scatter(
            x=df_final.index, 
            y=df_final[stock_name+'_20d_mean'],
            mode='lines',
            name='20d_mean',
            line = dict(
                color='#1E90FF', 
                width=2, 
            )
        )
    )
    fig.add_trace(
        go.Scatter(
            x=df_final.index, 
            y=(df_final[stock_name+'_20d_mean'] + df_final[stock_name+'_20d_std']),
            mode='lines',
            name='upper band',
            line = dict(
                color='#1E90FF', 
                width=0.5, 
            )
        )
    )
    fig.add_trace(
        go.Scatter(
            x=df_final.index, 
            y=(df_final[stock_name+'_20d_mean'] - df_final[stock_name+'_20d_std']),
            mode='lines',
            name='lower band',
            fill='tonexty',
            line = dict(
                color='#1E90FF', 
                width=0.5, 
            )
        )
    )

    # Highlight subprime mortgage crisis (ref. https://en.wikipedia.org/wiki/Subprime_mortgage_crisis)
    # Add shape regions
    fig.update_layout(
        shapes=[
            # 1st highlight during Feb 4 - Feb 6
            dict(
                type="rect",
                xref="x",
                yref="paper",
                x0='2007-01-01',
                y0=0,
                x1='2011-01-01',
                y1=1,
                fillcolor="LightSalmon",
                opacity=0.5,
                layer="below",
                line_width=0,
                name = 'Subprime mortgage crisis'
            ),
        ]
    )

    fig.add_annotation(
            xref="x",
            yref="paper",
            x='2009-01-01',
            y=0.9,
            text="Subprime mortgage crisis",
            showarrow=False,
            font=dict(
                family="Courier New, monospace",
                size=16,
                color="#ffffff"
                ),
            align="left",
            bordercolor="#c7c7c7",
            borderwidth=0,
            borderpad=0,
            bgcolor="#ff7f0e",
            opacity=0.8
            )

    # Prefix y-axis tick labels with dollar sign
    fig.update_yaxes(tickprefix="$")
    
    # Set figure title
    fig.update_layout(title_text= "Mean price and Bollinger Bands for " + stock_name )

    fig.show()
In [39]:
stock_list = ['googl', 'pfe', 'ba', 'cost','db', 'nke', 'nnn']

for i, val in enumerate(stock_list):
    if i<= len(stock_list):
        plotting_bollinger(val)

4. Clustering modelling

To evaluate wether the stocks targeted can be clustered together, we will use KMeans clustering. The initial approach will evaluate a simple model, built using all the datapoints (representing price movements between the start and end date selected).

Since there are huge price variations between the stocks we are examining, we will use a standardization step, using the Normalizer function available in scikit learn.

Some pre-processing steps before training our model are the evaluation of inertia and the silhouette analysis.

  • The inertia can suggest the optimal number of clusters to use. We can do this analysis by drawing a diagram of inertia versus cluster and evaluating the optimal parameter value through the so-called "elbow method".
  • The Silhouette analysis is applied when the true labels of the groups are not known, hence this is applicable in our case where we don't know, apriori, wether some stocks should belong to the same cluster.

Before fitting, we transpose the dataframe: each row will represent a stock ticker and each column a specific date.

[sh]:

In [18]:
# here we generate the feature we are going to use, price movements
price_movements_transposed = (ticker_history_open + ticker_history_close ).transpose()

price_movements_transposed.index = price_movements_transposed.index.rename('companies')

price_movements_transposed.head()
Out[18]:
date 2005-01-03 00:00:00+00:00 2005-01-04 00:00:00+00:00 2005-01-05 00:00:00+00:00 2005-01-06 00:00:00+00:00 2005-01-07 00:00:00+00:00 2005-01-10 00:00:00+00:00 2005-01-11 00:00:00+00:00 2005-01-12 00:00:00+00:00 2005-01-13 00:00:00+00:00 2005-01-14 00:00:00+00:00 ... 2017-12-15 00:00:00+00:00 2017-12-18 00:00:00+00:00 2017-12-19 00:00:00+00:00 2017-12-20 00:00:00+00:00 2017-12-21 00:00:00+00:00 2017-12-22 00:00:00+00:00 2017-12-26 00:00:00+00:00 2017-12-27 00:00:00+00:00 2017-12-28 00:00:00+00:00 2017-12-29 00:00:00+00:00
companies
AMZN 89.47 84.76 83.34 82.86 83.65 83.89 83.04 83.79 85.05 87.45 ... 2358.17 2377.95 2376.530 2368.12 2350.660 2340.440 2345.12 2362.17 2375.10 2351.82
WBA 80.13 80.44 81.21 82.19 82.69 83.37 83.31 83.68 83.20 83.02 ... 142.99 145.05 144.990 145.25 145.930 145.820 146.29 146.19 145.82 145.25
NOC 107.70 105.30 105.65 106.67 106.18 105.97 105.55 105.90 105.40 104.94 ... 618.97 617.71 615.980 614.59 611.220 608.290 608.26 610.81 615.11 615.16
LMT 109.61 107.27 107.92 109.14 109.28 109.70 109.85 110.75 111.10 111.50 ... 641.97 645.57 641.630 637.47 634.795 636.020 636.87 639.11 641.93 643.27
MCD 63.42 63.82 63.91 63.92 63.90 63.34 62.53 62.56 62.13 62.15 ... 347.83 349.19 347.875 345.72 344.530 342.795 342.38 344.00 346.36 345.10

5 rows × 3272 columns

In [19]:
# how many columns do we have? This equals to the number of attributes (or features) we have?
len(price_movements_transposed.columns)
Out[19]:
3272

4.1 Vanilla Clustering without dimensionality reduction

We have a lot of datapoints... One thing I'd like to do is try and test KMeans with all the price movements. Let's see what we get.

In [20]:
# Create a normalizer: normalizer
normalizer = Normalizer()

# normalize price movements
normalized_price_movements = normalizer.fit_transform(price_movements_transposed)

len(normalized_price_movements)
Out[20]:
37

One pre-processing step in clustering is to use the so-called 'elbow method' to decide how many 'k' we want to use for the clustering.

In [21]:
# deifine some useful functions

def inertia_plot(X, model, k_min = 2, k_max = 10):
    #A simple inertia plotter to decide K in KMeans
    inertia = []
    for x in range(k_min,k_max):
        clust = model(n_clusters = x)
        clust.fit(X)
        labels = clust.predict(X)
        inertia.append(clust.inertia_)
    plt.figure(figsize = (10,6))
    plt.plot(range(k_min,k_max), inertia, marker = 'o')
    plt.xlabel('Number of Clusters')
    plt.ylabel('Inertia')
    plt.title('Inertia Decrease with K')
    plt.xticks(list(range(k_min, k_max)))
    plt.show()
    
def silh_samp_cluster(X, model, k_min = 2, k_max = 10, metric = 'euclidean', gauss=False):
    # Taken from sebastian Raschkas book Python Machine Learning second edition
    for x in range(k_min, k_max):
        clust = model(n_clusters = x)
        if gauss:
            km=clust.set_params(n_components=x)
            y_km = km.fit(X).predict(X)
        else:
            km=clust.set_params(n_clusters=x)
            y_km = km.fit_predict(X)
        cluster_labels = np.unique(y_km)
        n_clusters = cluster_labels.shape[0]
        silhouette_vals = silhouette_samples(X, y_km, metric = metric)
        y_ax_lower, y_ax_upper =0,0
        yticks = []
        for i, c in enumerate(cluster_labels):
            c_silhouette_vals = silhouette_vals[y_km == c]
            c_silhouette_vals.sort()
            y_ax_upper += len(c_silhouette_vals)
            color = cm.jet(float(i)/n_clusters)
            plt.barh(range(y_ax_lower, y_ax_upper),
                    c_silhouette_vals,
                    height=1.0,
                    edgecolor='none',
                    color = color)
            yticks.append((y_ax_lower + y_ax_upper)/2.)
            y_ax_lower+= len(c_silhouette_vals)
            
        silhouette_avg = np.mean(silhouette_vals)
        plt.axvline(silhouette_avg,
                   color = 'red',
                   linestyle = "--")
        plt.yticks(yticks, cluster_labels+1)
        plt.ylabel("cluster")
        plt.xlabel('Silhouette Coefficient')
        plt.title('Silhouette for ' + str(x) + " Clusters")
        plt.show()
        
def silh_scores(X, model, k_min = 2, k_max = 10, gauss=False):    
    for x in range(k_min, k_max):
        clust = model(n_clusters = x)
        if gauss:
            km = clust.set_params(n_components=x)
            label = km.fit(X).predict(X)
        else:
            km=clust.set_params(n_clusters=x)
            label = km.fit_predict(X)
        print('Silhouette-Score for', x,  'Clusters: ', silhouette_score(X, label))
In [22]:
inertia_plot(normalized_price_movements, KMeans,)

The chart represents the number of clusters and their inertia. By looking at the data, we are not able to identify a clear inflection point. We can use k = 4 or 5 to start.

In [23]:
silh_scores(normalized_price_movements, KMeans)
Silhouette-Score for 2 Clusters:  0.2696305964927277
Silhouette-Score for 3 Clusters:  0.23881038889775272
Silhouette-Score for 4 Clusters:  0.2469637110689243
Silhouette-Score for 5 Clusters:  0.24767577449414208
Silhouette-Score for 6 Clusters:  0.2342785501090901
Silhouette-Score for 7 Clusters:  0.20052644931167513
Silhouette-Score for 8 Clusters:  0.1652075538991459
Silhouette-Score for 9 Clusters:  0.16355867482021308
In [24]:
silh_samp_cluster(normalized_price_movements, KMeans)

Together, the inertia and the silhouette score suggest that we don't have a good performance so far. The low silhouette score suggest overlapping clusters (score ~= 0) and the high inertia and lack of clear elbow make it hard to identify a good threshold for the number of cluster.

However, there is an indication that a k between 2 and 4 could provide a good initial fit.

In [25]:
# producing the first model using k = 4 and number of iteration 1000. If convergence is reached, the algo stops before
kmeans = KMeans(
    n_clusters = 4,
    max_iter = 1000
)

kmeans.fit(normalized_price_movements)
Out[25]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=1000,
       n_clusters=4, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)
In [26]:
# Predict the cluster labels: labels
clusters = kmeans.predict(normalized_price_movements)

# array containing all our companies
companies = ['amzn', 'wba', 'noc', 'lmt', 'mcd', 'intc', 'nav', 'ibm', 'txn', 'ge',
    'symc', 'pep', 'ko', 'jnj', 'tm', 'hmc', 'sne', 'xom', 'cvx', 'vlo',
    'f', 'googl', 'aapl', 'msft', 'pfe', 'mrk', 'ba', 'cost', 'c', 'db',
    'nke', 'dis', 'cat', 'spy', 'lh', 'rcl', 'nnn']

# Create a DataFrame aligning labels and companies: df
df = pd.DataFrame({'labels': clusters, 'companies': companies})

# Display df sorted by cluster label
print(df.sort_values(by='labels'))
    labels companies
18       0       cvx
34       0        lh
33       0       spy
32       0       cat
30       0       nke
28       0         c
25       0       mrk
24       0       pfe
21       0     googl
20       0         f
19       0       vlo
17       0       xom
15       0       hmc
14       0        tm
36       0       nnn
1        0       wba
11       0       pep
10       0      symc
5        0      intc
13       0       jnj
7        0       ibm
12       1        ko
29       1        db
6        1       nav
9        1        ge
16       1       sne
8        2       txn
26       2        ba
27       2      cost
35       2       rcl
4        2       mcd
31       2       dis
3        2       lmt
2        2       noc
23       2      msft
0        2      amzn
22       3      aapl

4.1.2 Agglomerative Clustering

Before proceeding to dimensionality reduction, I want to try a different clustering algorithm, Agglomerative Clustering, which is part of the hierarchical clustering algortihms.

In [27]:
silh_scores(normalized_price_movements, AgglomerativeClustering)
silh_samp_cluster(normalized_price_movements, AgglomerativeClustering)
Silhouette-Score for 2 Clusters:  0.35731716454575524
Silhouette-Score for 3 Clusters:  0.21541514602482711
Silhouette-Score for 4 Clusters:  0.22581299600766014
Silhouette-Score for 5 Clusters:  0.2453694423306611
Silhouette-Score for 6 Clusters:  0.24282035066517826
Silhouette-Score for 7 Clusters:  0.19132487766387554
Silhouette-Score for 8 Clusters:  0.19193152641091496
Silhouette-Score for 9 Clusters:  0.19323150959432533

It seems that agglomerative clustering performs poorly in respect of KMeans.

4.2 Dimensionality reduction using PCA

There is a saying that too much information makes you blind. This is also true for machine learning algorithms that can perform poorly when we feed too much noisy and/or redundant data to our algorithm.

For this reason, we can use dimensionality reduction algorithms, such as PCA, to decrease the number of attributes and keep only those that explained the highest degree of variance in our dataset.

In [28]:
# here we plot the explained variance versus the number of attributes

plt.figure(figsize=(8, 8))
pca = PCA().fit(normalized_price_movements)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');
plt.title('PCA Explained Variance')
plt.xticks(list(range(20)), list(range(20)))
plt.show()

This is a very interesting and promising visualization. It appears that 90% of the variance can be explained with 5 components. We might be able to improve the performance of our KMeans by looking at 5-7 components to explain 90-95% of the variability.

To do so, we repeat the preliminary analysis of the inertia and silhouette scores and then select the optimal number of clusters of the decomposed dataset.

In [29]:
norm_prices_pca = PCA(n_components=7).fit_transform(normalized_price_movements)
inertia_plot(norm_prices_pca, KMeans, k_max =20)

The inflection point is somewhere between 5 and 6 clusters.

In [30]:
silh_scores(normalized_price_movements, KMeans, k_min =3, k_max =8)
Silhouette-Score for 3 Clusters:  0.2433723893257129
Silhouette-Score for 4 Clusters:  0.2469637110689243
Silhouette-Score for 5 Clusters:  0.26002773782879746
Silhouette-Score for 6 Clusters:  0.20096689790183983
Silhouette-Score for 7 Clusters:  0.19282175458780132

The inertia and silhouette analysis indicate that 4 clusters would provide the optimal compromise to train our model.

In [31]:
# producing the first model using k = 4 and number of iteration 1000. If convergence is reached, the algo stops before
kmeans = KMeans(
    n_clusters = 4,
    max_iter = 1000
)

kmeans.fit(norm_prices_pca)

# Predict the cluster labels: labels
clusters = kmeans.predict(norm_prices_pca)

# array containing all our companies
companies = ['amzn', 'wba', 'noc', 'lmt', 'mcd', 'intc', 'nav', 'ibm', 'txn', 'ge',
    'symc', 'pep', 'ko', 'jnj', 'tm', 'hmc', 'sne', 'xom', 'cvx', 'vlo',
    'f', 'googl', 'aapl', 'msft', 'pfe', 'mrk', 'ba', 'cost', 'c', 'db',
    'nke', 'dis', 'cat', 'spy', 'lh', 'rcl', 'nnn']

# Create a DataFrame aligning labels and companies: df
df = pd.DataFrame({'labels': clusters, 'companies': companies})

# Display df sorted by cluster label
print(df.sort_values(by='labels'))
    labels companies
30       0       nke
16       0       sne
15       0       hmc
6        0       nav
29       0        db
12       0        ko
9        0        ge
26       1        ba
35       1       rcl
27       1      cost
0        1      amzn
8        1       txn
4        1       mcd
3        1       lmt
2        1       noc
31       1       dis
23       1      msft
22       2      aapl
32       3       cat
28       3         c
33       3       spy
34       3        lh
25       3       mrk
24       3       pfe
18       3       cvx
20       3         f
19       3       vlo
17       3       xom
14       3        tm
13       3       jnj
11       3       pep
10       3      symc
7        3       ibm
5        3      intc
1        3       wba
21       3     googl
36       3       nnn

4.3 Dimensionality reduction using t-SNE

Now we want to reduce the number of features we are going to consider. We are going to use t-SNE to decompose the dataset and maintain only the datapoints that explain the most variance of the dataset.

In [32]:
# Create a t-SNE model with learning rate 50
tsne = TSNE(learning_rate = 50)

# Fit and transform the t-SNE model on the numeric dataset
tsne_features = tsne.fit_transform(normalized_price_movements)
print(tsne_features.shape)
(37, 2)
In [33]:
# Add the reduced features to the dataframe

price_movements_transposed['x'] = tsne_features[:,0]

price_movements_transposed['y'] = tsne_features[:,1]

# Scatter plot, coloring by variety_numbers
sns.scatterplot(x="x", y="y", data=price_movements_transposed)
Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x125af8f10>

Now that we have reduced the dimensionality of the data, we apply again our KMeans model to the reduced attributes.

In [34]:
# Fit the model
kmeans.fit(tsne_features)

# Predict the cluster labels: labels
clusters = kmeans.predict(tsne_features)

# array containing all our companies
companies = ['amzn', 'wba', 'noc', 'lmt', 'mcd', 'intc', 'nav', 'ibm', 'txn', 'ge',
    'symc', 'pep', 'ko', 'jnj', 'tm', 'hmc', 'sne', 'xom', 'cvx', 'vlo',
    'f', 'googl', 'aapl', 'msft', 'pfe', 'mrk', 'ba', 'cost', 'c', 'db',
    'nke', 'dis', 'cat', 'spy', 'lh', 'rcl', 'nnn']

# Create a DataFrame aligning labels and companies: df
df = pd.DataFrame({'clusters': clusters, 'companies': companies})

# Display df sorted by cluster label
print(df.sort_values(by='clusters'))
    clusters companies
0          0      amzn
2          0       noc
3          0       lmt
4          0       mcd
31         0       dis
8          0       txn
27         0      cost
26         0        ba
23         0      msft
35         0       rcl
34         1        lh
33         1       spy
28         1         c
25         1       mrk
24         1       pfe
19         1       vlo
36         1       nnn
10         1      symc
1          1       wba
14         1        tm
5          1      intc
13         1       jnj
11         1       pep
6          2       nav
12         2        ko
16         2       sne
9          2        ge
29         2        db
32         3       cat
30         3       nke
17         3       xom
22         3      aapl
21         3     googl
20         3         f
15         3       hmc
7          3       ibm
18         3       cvx
In [35]:
fig, ax = plt.subplots(figsize=(15, 7))

label = ['amzn', 'wba', 'noc', 'lmt', 'mcd', 'intc', 'nav', 'ibm', 'txn', 'ge',
    'symc', 'pep', 'ko', 'jnj', 'tm', 'hmc', 'sne', 'xom', 'cvx', 'vlo',
    'f', 'googl', 'aapl', 'msft', 'pfe', 'mrk', 'ba', 'cost', 'c', 'db',
    'nke', 'dis', 'cat', 'spy', 'lh', 'rcl', 'nnn'],

# Define step size of mesh
h = 0.01

# Plot the decision boundary
x_min,x_max = tsne_features[:,0].min()-1, tsne_features[:,0].max() + 1
y_min,y_max = tsne_features[:,1].min()-1, tsne_features[:,1].max() + 1
xx,yy = np.meshgrid(np.arange(x_min,x_max,h),np.arange(y_min,y_max,h))

# Obtain labels for each point in the mesh using our trained model
Z = kmeans.predict(np.c_[xx.ravel(),yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)

# Define color plot
cmap = plt.cm.Paired

# Plotting figure
plt.clf()

plt.imshow(
    Z,
    interpolation = 'nearest',
    extent=(xx.min(),xx.max(),yy.min(),yy.max()),
    cmap = cmap,
    aspect = 'auto',
    origin = 'lower')

plt.plot(
    tsne_features[:,0],
    tsne_features[:,1],
    'k.',
    markersize = 5)
# Plot the centroid of each cluster as a white X

    
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:,0],centroids[:,1],marker = 'x',s = 169,linewidths = 3,color = 'w',zorder = 10)
plt.title('K-Means clustering on stock market movements (t_SNE-Reduced data)')
plt.xlim(x_min,x_max)
plt.ylim(y_min,y_max)
plt.show()

Conclusion

This notebook represents the attempt to cluster stocks, from different industry verticals and sectors, together based on their price movements. Despite the low degree of separation between the clusters, we can consider this experience a success.

Moreover, this is a good starting points for future clustering projects.

What to improve?

I only clustered the stocks based on daily price movements. This might result in grouping the stock by major trends in the market rather than a real correlation between specific stocks.

Additionally, the initial clustering is made using the full time window. Thus, there is an implicit assumption that the stocks should have moved reasonably together during the span of many years. This is probably not a very realistic assumption, thus it would be better to select shorter timeframes (e.g. quarterly) and repeat the analysis.

Furthermore, we could have enhanched our dataset with additional attributes, not only price movements. For instance, good indicators of stock trends are bid/asks volumes, volatility and additional financial indicators generally used in technical analysis (e.g. relative strenght indicator, average true range, P/E, dividend, etc. )

In [ ]: